{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import vaex\n",
    "import numpy as np\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the example dataset\n",
    "# df = vaex.example()\n",
    "\n",
    "# or downloads a slightly larger version of the example dataset\n",
    "df = vaex.datasets.helmi_de_zeeuw.fetch()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Expressions\n",
    "Expressions are only evaluated when needed by vaex, and save you memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(df.x**2 + df.y**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Virtual columns\n",
    "Expression can be added to a DataFrame to create a virtual column. A virtual column can be treated the same as a normal column, except it does not use up RAM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['r'] = np.sqrt(df.x**2 + df.y**2)\n",
    "df[['x', 'y', 'r']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.r.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# JIT (Just in time) compilation\n",
    "If an expression becomes to show, try optimizing it with numba, or Pythran"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['r_normal'] = np.sqrt(df.x**2 + df.y**2)\n",
    "df['r_jit'] = np.sqrt(df.x**2 + df.y**2).jit_numba()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit -n3 -r10\n",
    "df.mean(df.r_normal)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit -n3 -r10\n",
    "df.mean(df.r_jit)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Materialize\n",
    "Or, if you have plenty of RAM, materialize the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_m = df.materialize('r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit -n3 -r10\n",
    "df_m.mean(df.r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filtering\n",
    "Filtering makes no copy of the data, ideal when exploring your 1TB dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_filtered = df[df.x > 0]\n",
    "df_filtered[['x', 'y', 'r']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selections\n",
    "All statistical functions can take 1 or more selections as arguments. Multiple selections allow for multiple computations in 1 pass over the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.mean(df.x, selection=[df.x < 0, df.x > 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data cleansing\n",
    "Even fillna does not use memory, try different values without wasting time or RAM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_fillna_0 = df.fillna(value=0, column_names=['x'])\n",
    "df_fillna_3 = df.fillna(value=3, column_names=['x'])\n",
    "df_fillna_5 = df.fillna(value=5, column_names=['x'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# N-d statistics\n",
    "All statistical methods can be computed on N-dimensional regular grids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.mean(df.x, binby=df.y, limits=[-10, 10], shape=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualization\n",
    "The N-d statistics are the basis for many of the build-in visualizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot1d(df.x, limits=[-10, 10]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(df.x, df.y, limits=[-10, 10]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Interactive viz\n",
    "Based on ipywidgets / bqplot, you can even do interactive visualization\n",
    "\n",
    "*Note that (since we are on mybinder) we only use 100.000 rows, instead of 150.000.000 or >1.000.000.000 rows. Download it from https://docs.vaex.io/en/latest/datasets.html if you want to try it out on your local computer.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the first 100,000 rows \n",
    "df_taxi = vaex.open('./nyc_taxi_2015_100k.arrow')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_taxi.plot_widget(df_taxi.dropoff_longitude, df_taxi.dropoff_latitude, shape=400,\n",
    "                    f='log1p', controls_selection=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
